Skip to content

pack-objects: integrate --path-walk and some --filter options#2101

Open
derrickstolee wants to merge 10 commits intogitgitgadget:en/backfill-fixes-and-edgesfrom
derrickstolee:path-walk-filters
Open

pack-objects: integrate --path-walk and some --filter options#2101
derrickstolee wants to merge 10 commits intogitgitgadget:en/backfill-fixes-and-edgesfrom
derrickstolee:path-walk-filters

Conversation

@derrickstolee
Copy link
Copy Markdown

@derrickstolee derrickstolee commented Apr 27, 2026

NOTE: This series is based on en/backfill-fixes-and-edges.

The 'git pack-objects' command has a '--path-walk' option that uses the path-walk API instead of a typical revision walk to group objects into chunks by path name instead of relying solely on name-hashes to group similar files together. (It also does a second compression pass looking for better deltas after the first pass that is focused within chunks per path.)

The '--path-walk' feature was not previously integrated with the '--filter' feature, so a warning would appear and disable the path-walk API when a filter is given. This patch series integrates these together in the following ways:

  • --filter=blob:none updates the path-walk API options to skip blobs.
  • --filter=blob:limit=<size> adds a scan to a list of blob objects to remove objects that are too large.
  • --filter=sparse:<oid> adds a scan to the chunks to validate that the paths match the sparse-checkout patterns.

In particular, this last check is significantly faster than the previous algorithm because it can check all objects at a given path simultaneously instead of checking all sparse-checkout patterns for each discovered blob object.

A subtlety must be added here, in that we must change how we mark an object as "seen" during the path-walk. We may need to add an object to multiple paths and only mark it as "seen" if it indeed matched the sparse-checkout patterns as the path is accepted for emitting to the callback. This adds a new filter that the "seen" objects must also be removed from later chunks to avoid sending the same object as grouped to multiple chunks.

There's also a subtle detail here in that the path-walk API also prunes tree paths based on cone-mode sparse-checkouts, to enable 'git backfill --sparse' operating quickly for small sparse-checkout scopes. But the --filter=sparse:<oid> feature doesn't prune trees!

As a future step, I do plan to recommend that we add a treesparse:<oid> setting that does allow us to trim the tree set by cone-mode sparse patterns. At the time that partial clone filters were being created, cone mode sparse-checkout didn't exist and neither did the sparse index. Those features together make a smaller tree set possible, assuming the user never needs to change their scope. This would be a significant change so it is not implemented here, though the git pack-objects integration would be quick after this series completes.

Neither the sparse:<oid> or hypothetical treesparse:<oid> options are or should necessarily be supported by Git servers. It's too expensive to compute dynamically and it doesn't work well with reachability bitmaps. What becomes possible with this change is that it becomes reasonably fast to construct bundles with these filters that can bootstrap a working environment with the full history of all files within a given sparse-checkout scope.

Performance Results

Since the '--path-walk' option is ignored in today's Git version when a '--filter' is added, the performance matches the behavior without '--path-walk'. For the tables below, you can compare the rows against each other (time and then packfile size) for the mode without and then with '--path-walk' as a representation of "before" and "after". (These tables are repeated in the commit messages as new implementations improve specific rows.)

I chose a number of open source repositories of various sizes and shapes:

git/git

Test                                              HEAD
-------------------------------------------------------------------
5315.2: repack (no filter)                       27.73
5315.3: repack size (no filter)                 250.6M
5315.4: repack (no filter, --path-walk)          35.19
5315.5: repack size (no filter, --path-walk)    220.1M
5315.6: repack (blob:none)                       13.42
5315.7: repack size (blob:none)                 137.6M
5315.8: repack (blob:none, --path-walk)          20.98
5315.9: repack size (blob:none, --path-walk)    115.2M
5315.10: repack (sparse:oid)                     72.53
5315.11: repack size (sparse:oid)               187.5M
5315.12: repack (sparse:oid, --path-walk)        29.00
5315.13: repack size (sparse:oid, --path-walk)  161.0M

nodejs/node

Test                                              HEAD
--------------------------------------------------------------------
5315.2: repack (no filter)                       75.53
5315.3: repack size (no filter)                   0.9G
5315.4: repack (no filter, --path-walk)          80.54
5315.5: repack size (no filter, --path-walk)    885.7M
5315.6: repack (blob:none)                       12.65
5315.7: repack size (blob:none)                 148.6M
5315.8: repack (blob:none, --path-walk)          17.60
5315.9: repack size (blob:none, --path-walk)    134.6M
5315.10: repack (sparse:oid)                    518.84
5315.11: repack size (sparse:oid)               153.4M
5315.12: repack (sparse:oid, --path-walk)        27.99
5315.13: repack size (sparse:oid, --path-walk)  139.4M

microsoft/fluentui

Test                                              HEAD
--------------------------------------------------------------------
5315.2: repack (no filter)                      146.77
5315.3: repack size (no filter)                 562.1M
5315.4: repack (no filter, --path-walk)          72.82
5315.5: repack size (no filter, --path-walk)    172.6M
5315.6: repack (blob:none)                        4.84
5315.7: repack size (blob:none)                  62.7M
5315.8: repack (blob:none, --path-walk)           5.19
5315.9: repack size (blob:none, --path-walk)     59.9M
5315.10: repack (sparse:oid)                     59.95
5315.11: repack size (sparse:oid)                85.6M
5315.12: repack (sparse:oid, --path-walk)        15.16
5315.13: repack size (sparse:oid, --path-walk)   72.5M

microsoftdocs/azure-devops-docs

Test                                               HEAD
---------------------------------------------------------------------
5315.2: repack (no filter)                        4.41
5315.3: repack size (no filter)                   1.6G
5315.4: repack (no filter, --path-walk)           6.00
5315.5: repack size (no filter, --path-walk)      1.6G
5315.6: repack (blob:none)                        1.35
5315.7: repack size (blob:none)                  60.0M
5315.8: repack (blob:none, --path-walk)           1.23
5315.9: repack size (blob:none, --path-walk)     60.0M
5315.10: repack (sparse:oid)                    138.24
5315.11: repack size (sparse:oid)                84.4M
5315.12: repack (sparse:oid, --path-walk)         1.86
5315.13: repack size (sparse:oid, --path-walk)   84.4M

Performance conclusions

As seen in earlier series around the '--path-walk' feature, the space savings can be valuable but is not always guaranteed. When the space savings doesn't happen, then the time spent is generally slower because of the two-pass mechanism. The microsoftdocs/azure-devops-docs repo demonstrates this case quite clearly.

However, even in these cases the 'sparse:<oid>' filters are much faster because of the ability to check an entire set of objects against the sparse-checkout patterns only once.

Thanks,
-Stolee

UPDATES IN V2

  • Rebased onto en/backfill-fixes-and-edges to properly integrate with the incompatible rev-list options logic in that series.
  • Updated documentation as behavior changes. Credit to Taylor Blau for presenting these suggestions in his RFC [2].
  • Added three patches of Taylor's to extend more filter options.

P.S. I've CC'd the folks who were on the original path-walk feature thread [1]

[1] https://lore.kernel.org/git/pull.1819.git.1741571455.gitgitgadget@gmail.com/

[2] https://lore.kernel.org/git/cover.1777853408.git.me@ttaylorr.com/

cc: christian.couder@gmail.com
cc: gitster@pobox.com
cc: johannes.schindelin@gmx.de
cc: johncai86@gmail.com
cc: karthik.188@gmail.com
cc: kristofferhaugsbakk@fastmail.com
cc: me@ttaylorr.com
cc: newren@gmail.com
cc: peff@peff.net
cc: ps@pks.im

@derrickstolee derrickstolee force-pushed the path-walk-filters branch 3 times, most recently from 6fb9f3e to 859bee3 Compare May 1, 2026 20:12
@derrickstolee derrickstolee marked this pull request as ready for review May 2, 2026 14:14
@derrickstolee
Copy link
Copy Markdown
Author

/submit

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 2, 2026

Submitted as pull.2101.git.1777731354.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2101/derrickstolee/path-walk-filters-v1

To fetch this version to local tag pr-2101/derrickstolee/path-walk-filters-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2101/derrickstolee/path-walk-filters-v1

Comment thread builtin/backfill.c

repo_init_revisions(repo, &ctx.revs, prefix);
argc = setup_revisions(argc, argv, &ctx.revs, NULL);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <stolee@gmail.com>
>
> The 'git backfill' command uses the path-walk API in a critical way: it
> uses the objects output from the command to find the batches of missing
> objects that should be requested from the server. Unlike 'git
> pack-objects', we cannot fall back to another mechanism.
>
> The previous change added the path_walk_filter_compatible() method that
> we can reuse here. Use it during argument validation in cmd_backfill().
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  builtin/backfill.c  | 2 ++
>  t/t5620-backfill.sh | 8 ++++++++
>  2 files changed, 10 insertions(+)

Another topic adds a helper function to check for many incompatible
options and calls it from here.  When I merged this topic, I made an
semi-evil merge to move this call to that function (with necessary
adjustment to the parameter).  Please sanity check the resolution I
made in 'seen'.  Thanks.


> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index d794dd842f..51eaa42169 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -144,6 +144,8 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>  
>  	if (argc > 1)
>  		die(_("unrecognized argument: %s"), argv[1]);
> +	if (!path_walk_filter_compatible(&ctx.revs.filter))
> +		die(_("cannot backfill with these filter options"));
>  
>  	repo_config(repo, git_default_config, NULL);
>  
> diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
> index f3b5e39493..3580e10b9c 100755
> --- a/t/t5620-backfill.sh
> +++ b/t/t5620-backfill.sh
> @@ -15,6 +15,14 @@ test_expect_success 'backfill rejects unexpected arguments' '
>  	test_grep "unrecognized argument: --unexpected-arg" err
>  '
>  
> +test_expect_success 'backfill rejects incompatible filter options' '
> +	test_must_fail git backfill --objects --filter=tree:1 2>err &&
> +	test_grep "cannot backfill with these filter options" err &&
> +
> +	test_must_fail git backfill --objects --filter=blob:limit=10m 2>err &&
> +	test_grep "cannot backfill with these filter options" err
> +'
> +
>  # We create objects in the 'src' repo.
>  test_expect_success 'setup repo for object creation' '
>  	echo "{print \$1}" >print_1.awk &&

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 5/3/2026 6:59 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> The 'git backfill' command uses the path-walk API in a critical way: it
>> uses the objects output from the command to find the batches of missing
>> objects that should be requested from the server. Unlike 'git
>> pack-objects', we cannot fall back to another mechanism.
>>
>> The previous change added the path_walk_filter_compatible() method that
>> we can reuse here. Use it during argument validation in cmd_backfill().
>>
>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>> ---
>>  builtin/backfill.c  | 2 ++
>>  t/t5620-backfill.sh | 8 ++++++++
>>  2 files changed, 10 insertions(+)
> 
> Another topic adds a helper function to check for many incompatible
> options and calls it from here.  When I merged this topic, I made an
> semi-evil merge to move this call to that function (with necessary
> adjustment to the parameter).  Please sanity check the resolution I
> made in 'seen'.  Thanks.
Thanks for alerting me about this. I do think there is an error in
your merge. Hopefully the tests I wrote in this series caught the
mistake.

Here are the last lines in your copy of
reject_unsupported_rev_list_options():

	if (revs->filter.choice)
		die(_("'%s' cannot be used with 'git backfill'"),
		    "--filter");
	if (!path_walk_filter_compatible(&revs->filter))
		die(_("cannot backfill with these filter options"));
	if (revs->filter.blob_limit_value)
		die(_("cannot backfill with blob size limits"));

The last two options are correct, but they can't do anything
because the first one causes the command to fail immediately.

This should have caused failures in t5620-backfill.sh, specifically
the test 'backfill rejects incompatible filter options'. Indeed, I
get this error output when running on that commit:

+ test_grep cannot backfill with these filter options err
+ eval last_arg=${2}
+ last_arg=err
+ test -f err
+ test 2 -lt 2
+ test x! = xcannot backfill with these filter options
+ test x! = xcannot backfill with these filter options
+ grep cannot backfill with these filter options err
+ echo error: 'grep cannot backfill with these filter options err' didn't find a match in:
error: 'grep cannot backfill with these filter options err' didn't find a match in:
+ test -s err
+ cat err
fatal: '--filter' cannot be used with 'git backfill'
+ return 1

This does make it clear that I should add a new test in t5620 that
tests the 'sparse:<oid>' filter now that it is compatible, which I
missed in v1.

Thanks,
-Stolee

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 4, 2026

This branch is now known as ds/path-walk-filters.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 4, 2026

This patch series was integrated into seen via git@1942770.

@gitgitgadget gitgitgadget Bot added the seen label May 4, 2026
Comment thread builtin/pack-objects.c
@@ -5190,10 +5190,7 @@ int cmd_pack_objects(int argc,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <stolee@gmail.com>
>
> When 'git pack-objects' has the --path-walk option enabled, it uses a
> different set of revision walk parameters than normal. For once,

"once" -> "one" (or "instance")?

> --objects was previously assumed by the path-walk API and was not needed
> to be added. We also needed --boundary to allow discovering
> UNINTERESTING objects to use as delta bases.
>
> We will be updating the path-walk API soon to work with some filter
> options. However, the revision machinery will trigger a fatal error:
>
>   fatal: object filtering requires --objects
>
> The fix is easy: add the --objects option as an argument. This has no
> effect on the path-walk API but does simplify the revision option
> parsing for the objects filter.
>
> We can remove the comment about "removing" the options because they were
> never removed and instead not added. We still need to disable using
> bitmaps.

In the old code, there was a valid reason why bitmaps were not used
(i.e., "--objects" not enabled), but that no longer holds (i.e., now
we add "--objects" ourselves).  Do we need to give an updated
rationale to keep bitmap disabled?

> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  builtin/pack-objects.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index dd2480a73d..4338962904 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -5190,10 +5190,7 @@ int cmd_pack_objects(int argc,
>  	}
>  	if (path_walk) {
>  		strvec_push(&rp, "--boundary");
> -		 /*
> -		  * We must disable the bitmaps because we are removing
> -		  * the --objects / --objects-edge[-aggressive] options.
> -		  */
> +		strvec_push(&rp, "--objects");
>  		use_bitmap_index = 0;
>  	} else if (thin) {
>  		use_internal_rev_list = 1;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 5/3/2026 8:49 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> When 'git pack-objects' has the --path-walk option enabled, it uses a
>> different set of revision walk parameters than normal. For once,
> 
> "once" -> "one" (or "instance")?

Yes, "one". Sorry for the typo.

>> --objects was previously assumed by the path-walk API and was not needed
>> to be added. We also needed --boundary to allow discovering
>> UNINTERESTING objects to use as delta bases.
>>
>> We will be updating the path-walk API soon to work with some filter
>> options. However, the revision machinery will trigger a fatal error:
>>
>>   fatal: object filtering requires --objects
>>
>> The fix is easy: add the --objects option as an argument. This has no
>> effect on the path-walk API but does simplify the revision option
>> parsing for the objects filter.
>>
>> We can remove the comment about "removing" the options because they were
>> never removed and instead not added. We still need to disable using
>> bitmaps.
> 
> In the old code, there was a valid reason why bitmaps were not used
> (i.e., "--objects" not enabled), but that no longer holds (i.e., now
> we add "--objects" ourselves).  Do we need to give an updated
> rationale to keep bitmap disabled?

>>  	if (path_walk) {
>>  		strvec_push(&rp, "--boundary");
>> -		 /*
>> -		  * We must disable the bitmaps because we are removing
>> -		  * the --objects / --objects-edge[-aggressive] options.
>> -		  */
>> +		strvec_push(&rp, "--objects");
>>  		use_bitmap_index = 0;
>>  	} else if (thin) {
This old comment is perhaps confusing things. The important thing here
is to disable bitmaps with 'use_bitmap_index = 0;' (though perhaps not
for long [1]).

[1] https://lore.kernel.org/git/f50f8df01a9f216d5b4388b2fe4ff58077b574f3.1777853408.git.me@ttaylorr.com/

The path-walk API itself disables the objects walk for the revision
machinery in walk_objects_by_path():

	info->revs->blob_objects = info->revs->tree_objects = 0;

This allows the path-walk API to rely on the revision walk for a
_commits only_ walk and then have the path-walk API handle the trees
and blobs.

The reason we need to add "--objects" now is to allow for parsing the
"--filter" option without the revision logic complaining.

Thanks,
-Stolee

derrickstolee and others added 10 commits May 4, 2026 12:24
When 'git pack-objects' has the --path-walk option enabled, it uses a
different set of revision walk parameters than normal. For once,
--objects was previously assumed by the path-walk API and was not needed
to be added. We also needed --boundary to allow discovering
UNINTERESTING objects to use as delta bases.

We will be updating the path-walk API soon to work with some filter
options. However, the revision machinery will trigger a fatal error:

  fatal: object filtering requires --objects

The fix is easy: add the --objects option as an argument. This has no
effect on the path-walk API but does simplify the revision option
parsing for the objects filter.

We can remove the comment about "removing" the options because they were
never removed and instead not added. We still need to disable using
bitmaps.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Add p5315-pack-objects-filter.sh to measure the performance of
'git pack-objects --revs --all' under different filter and traversal
combinations:

 * no filter (baseline)
 * --filter=blob:none (blobless)
 * --filter=sparse:oid=<oid> (cone-mode sparse)

Each filter scenario is tested both with and without --path-walk,
producing paired measurements that show the impact of the path-walk
traversal for each filter type as we integrate the --path-walk feature
with different --filter options. It currently has no integration so
falls back to the standard revision walk. Thus, there are no significant
differences in the current results other than a full repack (and even
then, the --path-walk feature is not incredibly different for the
default Git repository):

Test                                             HEAD
-----------------------------------------------------
5315.2: repack (no filter)                      27.91
5315.3: repack size (no filter)                250.7M
5315.4: repack (no filter, --path-walk)         34.92
5315.5: repack size (no filter, --path-walk)   220.0M
5315.6: repack (blob:none)                      13.63
5315.7: repack size (blob:none)                137.6M
5315.8: repack (blob:none, --path-walk)         13.48
5315.9: repack size (blob:none, --path-walk)   137.7M
5315.10: repack (sparse:oid)                    72.67
5315.11: repack size (sparse:oid)              187.4M
5315.12: repack (sparse:oid, --path-walk)       72.47
5315.13: repack size (sparse:oid, --path-walk) 187.4M

The sparse filter definition is built automatically by sampling
depth-2 directories from the test repository, making the test work
on any repo passed via GIT_PERF_LARGE_REPO. For repos that lack
depth-2 directories, a single top-level directory is used; for flat
repos, the sparse tests are skipped via prerequisite.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The 'git pack-objects' command can opt-in to using the path-walk API for
scanning the objects. Currently, this option is dynamically disabled if
combined with '--filter=<X>', even when using a simple filter such as
'blob:none' to signal a blobless packfile. This is a common scenario for
repos at scale, so is worth integrating.

Also, users can opt-in to the '--path-walk' option by default through
the pack.usePathWalk=true config option. When using that in a blobless
partial clone, the following warning can appear even though the user did
not specify either option directly:

  warning: cannot use --filter with --path-walk

Teach the path-walk API to handle the 'blob:none' object filter
natively. When revs->filter.choice is LOFC_BLOB_NONE, the path-walk
sets info->blobs to 0 (skipping all blob objects) and clears the
filter from revs so that prepare_revision_walk() does not reject the
configuration.

This check is implemented in the static prepare_filters() method, which
will simultaneously check if the input filters are compatible and will
make the appropriate mutations to the path_walk_info and filters if the
path_walk_info is non-NULL. This allows us to use this logic both in the
API method path_walk_filter_compatible() for use in
builtin/pack-objects.c and as a prep step in walk_objects_by_path().

Update the test helper (test-path-walk) to accept --filter=<spec>
as a test-tool option (before '--'), applying it to revs after
setup_revisions() to avoid the --objects requirement check.

Also switch test-path-walk from REV_INFO_INIT with manual repo
assignment to repo_init_revisions(), which properly initializes
the filter_spec strbuf needed for filter parsing.

Add tests for blob:none with --all and with a single branch.

The performance test p5315 shows the impact of this change when using
blobless filters:

Test                                           HEAD~1     HEAD
---------------------------------------------------------------------
5315.6: repack (blob:none)                      13.53   13.87  +2.5%
5315.7: repack size (blob:none)                137.7M  137.8M  +0.1%
5315.8: repack (blob:none, --path-walk)         13.51   23.43 +73.4%
5315.9: repack size (blob:none, --path-walk)   137.7M  115.2M -16.3%

These performance tests were run on the Git repository. The --path-walk
feature shows meaningful space savings (16% smaller for blobless packs)
at the cost of increased computation time due to the two compression
passes. This data demonstrates that the feature is engaged and provides
real compression benefits when --no-reuse-delta forces fresh deltas.

Co-Authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
The 'git backfill' command uses the path-walk API in a critical way: it
uses the objects output from the command to find the batches of missing
objects that should be requested from the server. Unlike 'git
pack-objects', we cannot fall back to another mechanism.

The previous change added the path_walk_filter_compatible() method that
we can reuse here. Use it during argument validation in cmd_backfill().

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Extend the path-walk API to handle the 'blob:limit=<size>' object
filter natively. This filter omits blobs whose size is equal to or
greater than the given limit, matching the semantics used by the
list-objects-filter machinery.

When revs->filter.choice is LOFC_BLOB_LIMIT, the prepare_filters()
method stores the limit value in info->blob_limit and clears the filter
from revs. If the limit is zero, this degenerates to blob:none (all
blobs excluded), so info->blobs is set to 0 instead.

During walk_path(), blob batches are filtered before being delivered to
the callback: each blob's size is checked via odb_read_object_info(),
and only blobs strictly smaller than the limit are included. Blobs whose
size cannot be determined (e.g. missing in a partial clone) are
conservatively included, matching the existing filter behavior. Empty
batches after filtering are skipped entirely.

The check for inclusion in the path batch looks a little strange at
first glance. We use odb_read_object_info() to read the object's size.
Based on all of the assumptions to this point, this _should_ return
OBJ_BLOB. Since we are focused on the size filter, we use a
short-circuited OR (||) to skip the size check if that method returns a
different object type.

Notice that this inspection of object sizes requires the content to be
present in the repository. The odb_read_object_info() call will download
a missing blob on-demand. This means that the use of the path-walk API
within 'git backfill' would not operate nicely with this filter type.
The intention of that command is to download missing blobs in batches.
Downloading objects one-by-one would go against the point. Update the
validation in 'git backfill' to add its own compatibility check on top
of path_walk_filter_compatible().

Add tests for blob:limit=0 (equivalent to blob:none) and blob:limit=3
(which exercises partial filtering within a batch where some blobs are
kept and others are excluded).

Co-authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
The path-walk API prunes trees and blobs when a sparse-checkout pattern
list is provided, which is the correct behavior for 'git backfill
--sparse' since it only needs to fill in objects at paths within the
sparse cone.

However, a future change will use the path-walk API with a sparse:<oid>
filter that restricts only blobs while retaining all reachable trees.
To support both behaviors, add a 'pl_sparse_trees' flag to
path_walk_info. When set (as in 'git backfill --sparse' and the
--stdin-pl test helper mode), the sparse patterns prune both trees and
blobs. When unset, only blobs are filtered and all trees are walked and
reported.

Additionally, move the SEEN flag assignment in add_tree_entries() to
after the sparse pattern and pathspec checks. Previously, SEEN was set
immediately upon discovering an object, before checking whether its path
matched the sparse patterns. When the same object ID appeared at
multiple paths (e.g. sibling directories with identical contents), the
first path to be visited would mark the object as SEEN. If that path was
outside the sparse cone, the object would be skipped there but also
never discovered at its in-cone path.

By deferring the SEEN flag until after the checks pass, objects that are
skipped due to sparse filtering remain discoverable at other paths where
they may be in scope.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The --filter=sparse:<oid> option to 'git pack-objects' allows focusing
an object set to a sparse-checkout definition. This reduces the set of
matching blobs while retaining all reachable trees. No server currently
supports fetching with this filter because it is expensive to compute
and reachability bitmaps do not help without a significant effort to
extend the bitmap feature to store bitmaps for each supported sparse-
checkout definition.

Without focusing on serving fetches and clones with these filters, there
are still benefits that could be realized by making this faster. With
the sparse index, it's more realistic now than ever to be able to
operate a local clone that was bootstrapped by a packfile created with
a sparse filter, because the missing trees are not needed to move a
sparse-checkout from one commit to another or to view the history of any
path in scope. Such clones could perhaps be bootstrapped by partial
bundles.

Previously, constructing these sparse packs has been incredibly
computationally inefficient. The revision walk that explores which
objects are in scope spends a lot of time checking each object to see if
it matches the sparse-checkout patterns, causing quadratic behavior
(number of objects times number of sparse-checkout patterns). This
improves somewhat when using cone-mode sparse-checkout patterns that can
use hashtables and prefix matches to determine containment. However, the
check per object is still too expensive for most cases.

This is where the path-walk feature comes in. We can proceed as normal
by placing objects in bins by path and _then_ check a group of objects
all at once. Since sparse:<oid> only restricts blobs, the path-walk must
include all reachable trees while using the cone-mode patterns to skip
blobs at paths outside the sparse scope. This establishes a baseline for
a potential future "treesparse:<oid>" filter that would also restrict
trees, but introducing such a new filter is deferred to a later change.

The implementation here is focused around loading the sparse-checkout
patterns from the provided object ID and checking that the patterns are
indeed cone-mode patterns. We can then load the correct pattern list
into the path walk context and use the logic that already exists from
bff4555 (backfill: add --sparse option, 2025-02-03), though that
feature loads sparse-checkout patterns from the worktree's local
settings and also restricts tree objects. We use a combination of errors
and warnings to signal problems during this load. The difference is that
errors are likely fatal for the non-path-walk version while the warnings
are probably just implementation details for the path-walk version and
the 'git pack-objects' command can fall back to the revision walk
version.

Now that the SEEN flag is deferred until after pattern checks (from the
previous commit), handle the case where a tree with a shared OID appears
at both an out-of-cone and in-cone path. When trees are not being pruned
(pl_sparse_trees == 0), the path-walk re-walks the tree at the in-cone
path so that in-cone blobs within it are discovered. The new tests in
t5317 and t6601 demonstrate this behavior and would fail without these
changes.

The performance test p5315 shows the impact of this change when using
sparse filters:

Test                                              HEAD~1     HEAD
----------------------------------------------------------------------
5315.10: repack (sparse:oid)                      77.98    77.47  -0.7%
5315.11: repack size (sparse:oid)                187.5M   187.4M  -0.0%
5315.12: repack (sparse:oid, --path-walk)         77.91    31.41 -59.7%
5315.13: repack size (sparse:oid, --path-walk)   187.5M   161.1M -14.1%

These performance tests were run on the Git repository. The --path-walk
feature shows meaningful space savings (14% smaller for sparse packs)
and dramatic time savings (60% faster) by leveraging the path-walk's
ability to skip blobs outside the sparse scope.

Co-authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Taylor Blaue <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
The `tree:0` object filter omits all trees and blobs from the result,
keeping only commits and tags. Consequently, this filter type should
has a fairly straightforward integration with path-walk, as the decision
to include an object depends only on its type and does not depend on any
path-sensitive state.

Mapping it onto `path_walk_info` is direct: set `info->trees = 0` and
`info->blobs = 0` in `prepare_filters()` when the `LOFC_TREE_DEPTH`
choice is requested with depth zero. The existing code already plumbs
those flags through the rest of the walk:

 - 'walk_objects_by_path()' sets `revs->blob_objects = info->blobs` and
   `revs->tree_objects = info->trees` before `prepare_revision_walk()`,
   so the revision walk doesn't try to enumerate trees or blobs itself.

 - The commit-walk loop short-circuits the root-tree fetch with
   "if (!info->trees && !info->blobs) continue;", so we never even
   look up the root tree, let alone descend into it.

 - `setup_pending_objects()` skips pending trees and blobs based on
   the same flags.

This means the path-walk doesn't allocate or expand any tree structures
at all under `tree:0`, which matches the intended behavior of the
filter.

Non-zero tree-depth filters are not supported. Those depend on the depth
at which a tree is visited, which is a path-walk concept the filter
machinery doesn't currently share with the path-walk API. Reject them in
`prepare_filters()` with a helpful error and let pack-objects fall back
to the regular traversal, the same way it already does for unsupported
filters.

Add coverage in t6601 for both `--all` and a single-branch case to
confirm that no trees or blobs are emitted, and a separate test that
`tree:1` is rejected with the expected error message. Place the new
tests before "setup sparse filter blob" so they run on the original set
of refs, before the orphan branch that the sparse-tree tests create.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
The `object:type` filter accepts only objects of a single type; it is
the second member of the object-info-only filter family that bitmap
traversal already supports.

Like `blob:none` and `tree:0`, it can be evaluated with nothing more
than the object's type, which is exactly the granularity path-walk's
existing info->{commits,trees,blobs,tags} flags already control.

Map `LOFC_OBJECT_TYPE` in `prepare_filters()` by AND-ing each flag
against the filtered type. A single `object:type=X` filter
applied to the default info (all flags = 1) leaves `info->X = 1` and
all the others 0, which is what we want.

Using an AND rather than straight assignment prepares us for a
subsequent change to implement combined object filters.

The path-walk machinery is mostly already wired for the per-type
distinction:

 - `walk_path()` calls `path_fn` for a batch only when the corresponding
   `info->X` flag is set, so unwanted types are silently not reported.

 - `add_tree_entries()` skips tree entries of type `OBJ_BLOB` when
   `info->blobs` is unset, so we don't even allocate paths for them.

 - The commit-walk loop short-circuits the root-tree fetch when
   `!info->trees && !info->blobs`, so commit-only filters don't descend
   into trees at all.

But there are a couple of side effects of the "trees off, blobs on" case
that need fixing:

 1. 'setup_pending_objects()' previously skipped pending trees as soon
    as `info->trees` was zero. For 'object:type=blob' the call site
    needs those pending trees: a lightweight tag pointing to a tree, or
    an annotated tag whose peeled target is a tree, can both reach
    blobs that are otherwise unreachable from any commit's root tree.
    Loosen the gate to "if (!info->trees && !info->blobs) continue" and
    similarly retrieve the root_tree_list whenever either trees or
    blobs are wanted.

 2. The revision machinery's `handle_commit()` drops pending trees when
    `revs->tree_objects` is zero (see the 'OBJ_TREE' handler in
    revision.c), so by the time path-walk sees the pending list
    after `prepare_revision_walk()` the tree-bearing pendings would
    already be gone. Fix this by setting

        revs->tree_objects = info->trees || info->blobs

    so pending trees survive `prepare_revision_walk()` whenever we
    need to walk into them. Path-walk still resets tree_objects to
    zero immediately after `prepare_revision_walk()` returns, so the
    rev-walk itself never enumerates trees redundantly with
    path-walk's own descent.

Add coverage in t6601 for each of the four `object:type` values. The
'object:type=blob' test in particular asserts that file2 and child/file
(both reachable only through tag-pointed trees) show up in the output,
exercising the pending-tree fix.

Update Documentation/git-pack-objects.adoc to add object:type to
the list of supported --filter forms.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
The `combine` filter takes the intersection of its children, that is:
objects are shown only when all child filters would admit the object.

The preceding patches added support for many individual filter types.
Enable users to compose these filters by implementing support for the
`combine` filter type.

Mapping intersection onto path_walk_info works because every supported
child filter is a monotonic restriction:

 - `blob:none`, `tree:0` unconditionally clear `info->blobs` and (for
   `tree:0`) `info->trees`; clearing an already-cleared flag is a
   no-op.

 - `object:type=X` is now expressed as an AND of each type flag with the
   filtered type, so applying multiple such filters only refines the
   existing set rather than overwrites it.

 - `blob:limit=N` has to compose too: the intersection of "size < L1"
   and "size < L2" is "size < min(L1, L2)".

   Update the `LOFC_BLOB_LIMIT` handler to take the running minimum when
   `info->blob_limit` is already set, so a combined filter with, e.g.,
   both "blob:limit=10" and "blob:limit=5" produces a limit of 5
   regardless of ordering.

 - `sparse:oid` is left unchanged. A `combine` filter that includes a
   `sparse:oid` is allowed at most once, since the existing handler
   refuses to overwrite `info->pl`. Two `sparse:oid` filters in a single
   `combine` would be unusual and are rejected with a warning, matching
   the standalone `sparse:oid` behavior.

Implementation-wise, the existing `prepare_filters()` called
`list_objects_filter_release()` inside each case branch. That works fine
for top-level filters, but `combine` filters need to recurse over its
  child filters without releasing each one in turn (since the parent's
  release iterates the sub array). Split `prepare_filters()` into a
  recursive helper that performs only the mutation, plus a thin wrapper
  that calls the helper and then releases the top-level filter once.

The `LOFC_COMBINE` case in the helper just walks `sub_nr` and recurses;
child filters are released by the wrapper's single
`list_objects_filter_release()` call on the parent (which itself
recursively releases each sub-filter, the same way it always has).

If any sub-filter is unsupported (e.g. "tree:1", "sparse:<path>", or a
not-yet-supported choice), the recursion bubbles a failure up and the
existing pack-objects/backfill fallback paths kick in.

Add coverage in t6601:

  - "combine:blob:none+tree:0" collapses to "tree:0"

  - "combine:object:type=blob+blob:limit=3" yields only the blobs
    smaller than three bytes

  - "combine:object:type=blob+object:type=tree" intersects to empty

  - "combine:tree:1+blob:none" reports the "tree:1" error.

Update Documentation/git-pack-objects.adoc to add combine to the
list of supported --filter forms.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
@derrickstolee derrickstolee changed the base branch from master to en/backfill-fixes-and-edges May 4, 2026 16:52
@derrickstolee derrickstolee force-pushed the path-walk-filters branch 2 times, most recently from c35b522 to 5423273 Compare May 4, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants